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Abstract 

In nonparametric classification and regression problems, regular- 
ized kernel methods, in particular support vector machines, attract 
much attention in theoretical and in applied statistics. In an abstract 
sense, regularized kernel methods (simply called SVMs here) can be 
seen as regularized M-estimators for a parameter in a (typically infinite 
dimensional) reproducing kernel Hilbert space. For smooth loss func- 
tions L, it is shown that the difference between the estimator, i.e. the 
empirical SVM /i,D n ,A Dn ) an d the theoretical SVM Jl,p,\ is asymp- 
totically normal with rate \fn. That is, v^(/£,D„,A D ~~ /l,p,a ) con- 
verges weakly to a Gaussian process in the reproducing kernel Hilbert 
space. As common in real applications, the choice of the regularization 
parameter D n in /l,d„.a d „ may depend on the data. The proof is done 
by an application of the functional delta-method and by showing that 
the SVM-functional P n- /l.p.a is suitably Hadamard-differentiable. 

Keywords: Nonparametric regression, support vector machines, asymptotic 
normality, Hadamard-differentiability, functional delta-method 
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1 Introduction 



One of the most important tasks in statistics is the estimation of the in- 
fluence of an input variable X on an output variable Y. On the basis of a 
finite data set (xi, yi), . . . , (x n ,y n ) £ X x y , the goal is to find an "optimal" 
predictor / : X — )• y which makes a prediction f(x) for an unobserved y . In 
case of a finite space y, this is called classification and, in case of an infinite 
space JcR, this is called regression. Often, a signal plus noise relationship 
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y = fo(x) + e is assumed and the task is to estimate the unknown regres- 
sion function /q . In parametric statistics, it is assumed that /o is contained 
in a known finite-dimensional function space. This assumption is dropped 
or, at least, considerably weakened in nonparametric statistics. In nonpara- 
metric classification and regression problems, regularized kernel methods, in 
particular support vector machines, recently attract much attention in the- 
oretic al and in applied statis t ics; s ee e.g. the comprehensive books VapnikI 
(| 19981 ). IScholkopf and Srnd5 tooi ). and ISteinwart and ChristmannT feoOS) 



and the references cited therein. For convenience, a large class of regu- 
larized kernel methods for classification and regression (based on any loss 
funct ion) is called "support vector ma chine" (SVM) in the following, e.g. 



as m lSteinwart and Christmannl (|2008l ). That is, the term "support vector 
machine" (SVM) is used in a broad sense here whereas, originally, the term 
"support vector machine" was coined for the special case where y = { — 1, 1} 
(binary classification) and where the loss function L is the so-called hinge- 
loss. 

Typically, the weaker assumptions in nonparametric statistics have to be 
compensated by an increase of observations in order to obtain the same preci- 
sion of the estimation. Nevertheless, it is well-known that some nonparamet- 
ric estimators still are asymptotically normal for the same rate yjn as many 
parametric estimators. In this article, it is shown that also support vec- 
tor machines based on smooth loss functions enjoy an asymptotic normality 
property for the rate y/n. For an i.i.d. sample D n = Uxi,yi), ■ ■ ■ > (xi,y n )j 
from a distribution P, the empirical SVM is a function fL,D n ,x Dn which 
solves the minimization problem 

1 n 

min-^L^,^/^)) + X D Jff H , (1) 

1=1 

where L is a loss function and H is a certain space of functions / : X — > R, 
namely a so-called reproducing kernel Hilbert space. The first term in (JTJ) 
is the empirical mean of the losses caused by the predictions f(xi) and the 
second term penalizes the complexity of / in order to avoid overfitting; the 
regularization parameter Ad„ is a positive real number which is typically 
chosen in a data-driven way, e.g., by cross-validation. 

Depending on the size of the space H, SVMs can be used as a parametric or 
a non-parametric method. Choosing a finite-dimensional H leads to a para- 
metric setting, choosing an infinite-dimensional H leads to a non-parametric 
setting. In the parametric setting, asymptotic normality of support vector 
machines in the original sense (binary classification using the hinge loss) has 



2 



already been investigated: Jiang et al. ( 20081 ) derive asymptotic normality 
of the estimated prediction error of SVMs with finite-dime nsional H. Unde r 
some regularity conditions on the distribution of the data, Koo et aD ( 2008 ) 
show asymptotic normality of the coefficients of the linear SVM (i.e., H 
only contains linear functions). In the following, a general non-parametric 
setting (covering classification and regression) is considered but, by going 
over from parametrics to non-parametrics, we have to impose a bound on 
the complexity of the predictor. Instead of estimating a solution /£ p of the 
(ill-posed) minimization problem 



mm 

feH 



in / L((x,y,f(x))P(d(x,y)) 
H J 



we estimate a smoother approximation, namely the solution /l,p,a of the 
minimization problem 



mm 

feH 



L((x,yJ(x))P{d(x,y)) + \ \\ff H (3) 



for a fixed regularization parameter Ao £ (0,oo). The minimizer fLP\ 
of ([3]) is called theoretical SVM. This so-called Tikhonov regularization is 
equivalent to a minimization problem 

J L((x,y,f(x))P(d(x,y)) = min! feH, \\f\\ H <r 

where ro can be interpreted as an upper bound on the complexity of the 
function /; a smaller Ao > corresponds to a larger ro > 0. It will be shown 
that the sequence of SVM-estimators 

(Xxy) n -+ H, D n ^ f L , Dn ,x Dn 

is asymptotically normal for the rate yfn if the empirical SVM /l,d„,a_d is 
shifted by the theoretical SVM fL,p,\ - That is, 

Vn(fL,D n ,x Dn ~ fl,P, A ) 

converges weakly to a (zero-mean) Gaussian process in the function space 
PL. This also implies asymptotic normality of the risk 

) -n L ,p(f LtP<Xo )) ~> aAf (0,1) 

where HL,p(f) = f L(x,y, f(x)) P[d{x,y)j denotes the risk of a predictor 
/ and g £ [0, oo). The regularization parameter Ad„ for the empirical 
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SVM may depend on the data. We only need that \/n(\D n — Ao) converges 
to in probability. This will be proven by an advanced application of a 
functional delta-method. Accordingly, it will be shown that the map P i— > 
Jl,P,\ is suitably Hadamard-differentiable. According to (pQ) and ([3]), SVMs 
can be seen as (regularized) M-estimators for a parameter in a typically 
infinite dimensional Hilbert space. Asymptotic normality of M-estimators 
for finite-dimensional parameters and rates of co nvergence of M-esti mators 



van 



de Geerl tod ). 



for parameters in metric spaces are considered in 

Of course, it would be desirable to dispense with the complexity bound and 
to have asymptotic normality of 



ftp) 



instead of 



£>n,Ar 



L,P,X ) 



- if /£ p exists at all. However, in the non-parametric setting where H is a 
large infinite-dimensional function space, this is not possible. Such a result 
would violate the no-free-lunch theorem which, roughly speaking, yields that 
there is no uniform rate of convergence without such a bound on the com- 
plexity. It is only possible to get uniform rates of convergence within special 
classes of distributions. The investigation of rates of convergence for special 
cases - e.g. classification under assumptions on t he unknown true probabil- 
ity measure such as Tsybakov's noise assumption (|Tsvbakovl . I2004L p. 138) - 
is one of the most important topics of recent r esearch about support vecto r 
machines and related learning m et hods; see e.g.lSteinwart and Scovell (12 007 ) 



Steinwart et al 



Canonnetto and De Vitol (12007 ) iBlanchard et al.1 (12008h . 
(|2009h . lMendelson and Neemanl told ), .t is a matter of further research if 
similar assumptions on the unknown true probability measures allow asymp- 
totic normality of Vn(f L>Dn ,\ Dn - /£ jP ). 

The article is organized as follows: Section [2] briefly recalls the definition of 
support vector machines in a broad sense and fixes the notation. Section \3. II 
contains the main results concerning asymptotic normality of support vector 
machines and their risks. Since the proof is quite involved, it is deferred to 
the appendix but Section 13.21 provides a short outline. Finally, Sections H] 
contains some concluding remarks. 



2 Support Vector Machines 

Let (0,^4, Q) be a probability space, let X be a closed and bounded subset 
of El d , and let y be a closed subset of R with Borel-a-algebra 23(30 . The 
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Borel-cr-algebra of X x y is denoted by 25 ( X x y) . Let 



x x ,...,x n : (n,A,Q) — >• (#,»(*)), 
y 1; ...,y n : (n,AQ) — > (y,»(y)) 

be random variables such that (Xi,Y\), . . . , (X n , Y n ) are independent and 
identically distributed according to some unknown probability measure P 
on (X x y,<B(X x y)). Define 

D n := ((Xi,Yi),...,(X n ,Y n )) VnGlN. 

A measurable map LiA'x^xIR— >• [0, oo) is called loss function. A loss 
function L is called convex loss function if it is convex in its third argument, 
i.e. t i — y L(x, y, t) is convex for every (x, y) £ X x y. Furtheremore, a loss 
function L is called P-integrable Nemitski loss function of order p G [l,oo) 
if there is a P-integrable function b : X x y — > R such that 

\L(x,y,t)\ <b(x,y) + \t\ p V (x, y, t) G X x 3? x R . 

If 6 is even P-s^uare-integrable, L is called P-sguare-integrable Nemitski loss 
function of order p £ [l,oo). The risk of a measurable function / : X — )■ R 
is defined by 

= / L(x,y,f(x))P(d(x,y)) . 
Jxxy 

The goal is to estimate a function / : X — > R which minimizes this risk. 
The estimates obtained from the method of support vector machines are 
elements of so-called reproducing kernel Hilbert spaces (RKHS) H. A RKHS 
H is a certain Hilbert space of functions /:<¥—>■ R which is genera ted 
by a kernel k : X x X — > R. See e.g. IScholkopf and Smo or 



Steinwart and Christmann ( 20081 ) for details about these concepts. 



Let H be such a RKHS. Then, the regularized risk of an element / G H is 
defined to be 

^l,p,a(/) = n LtP {f) + AH/HI, , where A G (0,oo) . 

An element / G H is called a support vector machine and denoted by /x, p ^ 
if it minimizes the regularized risk in H . That is, 

n L , P (f L , P> x) + M\fL,p,x\\ 2 H = mf n LtP (f) + M\f\\ 2 H ■ 
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The SVM- estimator is defined by 

S n : (Xxy) n ^ H, D n h+ f L , DniXDn 
where fL,D n ,\ D is that function f £ H which minimizes 
1 - 

-Y.LfayiJixi)) + A D J|/||| (4) 
i=l 

in il for D n = ((xi, X2), • • • , (x n , y n )) G (Af x y) n . The empirical support 
vector machine fL,D n ,\ D uniquely exists for every \£> n G (0, 00) and every 
data-set D n G (X X y) n if t t-t L(x, y, t) is convex for every (x, y) £ X x y. 

The symbol ~> denotes weak convergence of probability measures or random 
variables. 



3 Asymptotic Normality 
3.1 Main Results 

The following theorems provide the main results. For random sequences 
of regularization parameters (AD„)neiN C (0, 00) which converges in prob- 
ability with rate y/n to some Ao G (0, 00) , Theorem 13.11 says that the 
-y/n-standardized difference between the empirical support vector machine 
/l,d„,a d an d the theoretical support vector machine /l,p,a is asymptoti- 
cally normal under some relatively mild conditions. That is, the if- valued 
random variable 

O -)■ H, UJ -»• v^(/L,D n ( W ),A DB(<1>) - /L,P,Ao) 

converges weakly to a random variable 

H : n -> H, oj 1 — y H(w) 

which is a Gaussian process in H . Accordingly, for every finite collection of 
functions . . . , f m } C H, the random variable 

Q, ->• K m , oj ^ (</ 1 ,H(w)>H,...,</m ) H(a;)>H 

has a multivariate normal distribution. In particular, the reproducing prop- 
erty of k implies that, for every x\, . . . , x m G X , 



/ fL,D n ,\ Un (xi) - fL,p,x (xi) 

\ fL,L> n ,\ Dn (Xm) ~ fL,P,X ( X m) 



AA m (0,S) 
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where S is a covariance matrix. In addition, Theorem 13.21 provides \fn- 
consistency of the risk. 

Theorem 3.1 Let X C R d be closed and bounded and let y C R be closed. 
Assume that k : X X X — )■ R is the restriction of an m- times continuously 
differ entiable kernel k : H d x R d — > R such that m > d/2 and k 0. Let H 
be the RKHS of k and let P be a probability measure on {X x y, *3(X x y)) . 
Let 

L : X x y x R — > [0, oo) , (x, y, t) >->■ L(x,y,t) 

be a convex, P-square-integrable Nemitski loss function of order p £ [l,oo) 
such that the partial derivatives 

L'(x,y,t) := ^(x,y,t) and L"(x,y,t) := 

exist for every (x, y, t) £ X x y x R . Assume that the maps 

(x,y,t) H- L'(x,y,t) and {x,y,t) H- L"(x,y,t) 

are continuous. Furthermore, assume that for every a £ (0, oo), there is a 
b' a £ ^(P) an d a constant b" a £ [0, oo) suc/i that, for every (x,y) £ X x y, 

sup |L'(a?,y,t)| < b' a (x,y) and sup y, t)\ < b" a . (5) 

t£[—a,a] t£[—a,a] 

Then, for every Xq £ (0,oo), there is a tight, Borel-measurable Gaussian 
process 

H : n -> ff, w !-»■ H(w) 

suc/i i/tatj 

Vn(h,D n ,x Dn - fL,p,\ ) ~* H inff (6) 

/or ewery Borel-measurable sequence of random regularization parameters 
Ad„ with 

\/n{Xji — Xq) )• m probability . 

n— >oo 

The Gaussian process H is zero-mean; i.e., E(/,H)jj = /or every f £ H . 

By use of tis theorem, the following asymptotic result on the risks is ob- 
tained. 
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Theorem 3.2 Under the assumptions of Theorem \3.1l there is, for every 
Aq G (0,oo), a constant a G [0, oo) such that 



V^(K L>P (f L ,n n>XDn )-K L ,p(fL,p,x )) ~> o-JV(0,l) 

/or every Borel- measurable sequence of random regularization parameters 
Ad„ wif/i v^(Ad„ — Ao) ► in probability. 

According to the above theorems, the Gaussian process H and the constant 
a do not depend on the sequence Ad„, n £ IN, but only on Ao- Though it 
is possible that H degenerates to 0, this only happens in trivial cases, e.g., 
if P is equal to a Dirac distribution, or \Y\ < £ while using a smoothed 
version of the epsilon-insensitive loss; see Remark 13.61 If the constant a is 
equal to in Theorem 13.21 the limit degenerates to 0. In contrast to H, 
this not only happens in degenerated cases. For example, it is known that 
the rate of convergen c e of t he risk is faster than y/n in some cases (see e.g. 
Steinwart and Scovel ( 20071 )) which leads to a degenerated limit in Theorem 



As stated above, the results are true under some relatively mild assumptions. 
In particular, the assumptions on k are fulfilled for all of the most common 
kernels (e.g. Gaussian RBF kernel, polynomial kernel, exponential kernel, 
linear kernel). It is assumed that the loss function is two times continuously 
differentiable in the third argument. On the one hand, this is an obvious 
restriction because some of the most common loss functions are not differ- 
entiable: the epsilon-insensitive loss for regression and the hinge loss for 
classification. On the other hand, this assumption is not based on any un- 
known entity such as the model distribution P . In particular, a practitioner 
can a priori meet this requirement by a suitable choice of the loss function; 
e.g. the least-squares loss for regression and the logistic loss for classifica- 
tion. This is contrary to the noise assumptions common in order to establish 
rates of convergence to the Bayes risk because such assumptions depend on 
the unknown P so that they can hardly be checked in applications. In ad- 
dition, Remark 13.51 describes how a Lipschitz-continuous loss function (such 
as the epsilon-insensitive loss and the hinge loss) can always be turned into 
a differentiable e-version of the loss function. That is, though the theorem 
does not cover support vector machines in the original terminology, it covers 
variants based on a slightly smoothed hinge loss. 

In order to ensure mere existence of the theoretical SVM fL,p,x > it is neces- 
sary to assume a P - integrabilty condition. For exa mple, it is common to as- 



sume that L is a P - integrable Nemitski loss function lChristmann and Steinwart 



S 



(2007). ~ n order to obtain asymptotic normality in the above theorems, we 
assume that L is a P- sguare-integrable Nemitski loss function which seems 
to be a natural assumption in view of the square-integrability assumptions 
for usual central limit theorems. In addition, a similar P - integrabilty con- 
dition is assumed for the derivative of the loss function. If y is bounded (as, 
e.g., in case of a classification problem) and L, L' and L" are continuous, all 
of the integrability assumptions are fulfilled. 
In order to fulfill 

vWAd„ — \q) > in probability, 

(which is the only assumption on the random sequence of regularization 
parameters), it is possible to use any data-driven method for choosing the 
regularization parameter. The only thing one has to do is to choose a (pos- 
sibly large) constant c G (0, oo) and to make sure that the method (e.g. 
cross validation) picks a value from [Ao , Ao + cj ' \Jn ln(n) ]. Note that, as 
the notation suggests, it is indeed possible to use the same data for choosing 
the regularization parameter as for building the final SVM - just as usually 
done by practitioneers, e.g., when applying cross validation. 
The following examples list some general situations in which Theorems 13.11 
and 13.21 are applicable. 

Example 3.3 (Classification) Theorems \3.1\ and \3.£\ are applicable in the 
following setting for a classification problem: 

• X bounded and closed, y = {— 1; 1} 

• k a Gaussian RBF kernel, a polynomial kernel, an exponential kernel 
or a linear kernel 

• L the least-squares loss or the logistic loss 

Example 3.4 (Regression) Theorems \3.1\ and \3.2\ are applicable in the 
following setting for a regression problem: 

• X bounded and closed, y closed 

• k a Gaussian RBF kernel, a polynomial kernel, an exponential kernel 
or a linear kernel 

• L the least-squares loss 

• P such that f y A P(d(x,y)) < oo 
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The following Remark 13.51 describes how a Lipschitz-continuous loss func- 
tion can always be turned into a differentiable e- version of the loss function 
such that all of the assumptions on the partial derivatives L' and L" are 
automatically fulfilled. In particular, the proposed construction works for 
the epsilon-insensitive loss and the hinge loss. 

Remark 3.5 (Smoothing loss functions by use of mollifiers) Let L : 

X x y x R — > [0, oo) be a convex P-square-integrable Nemitski loss function 
of order p G [l,oo). Assume that L is also a Lipschitz-continuous loss func- 
tion. That is, there is a constant b' G (0, oo) such that 



sup 

(x,y)£Xxy 



\L(x,y,ti) - L(x,y,t 2 )\ < b'\t\ — t%\ Vii,i 2 GR. 



Then, for every e > 0, it is possible to construct a loss function L £ such that 

\L(x,y,t) - L £ (x,y,t)\ < e V (x, y, t) G X x y x R (7) 

and all of the assumptions of Theorems I3.il and \3.2\ are fulfilled for L e . 
This can be done in the following way: Take a so-called mollifier function 
tp : It ->■ R; e.g., 



R 



R. 



t ^ 7 l e /(-i,i)(t) 



wher e 7 G (0, 00) is chosen so that JipdX = 1. (See e.g. tPenkowski et al. 



200§i . p. 34-lff) for the concept of mollifiers and their basic properties.) De- 
fine ipe(s) = visb' /e) for every s £ R and 



L e (x,y,t) 



V f 

— hp £ (s)L(x,y,t- s)X(ds) \/(x,y,t) 



(8) 



Then, ([?[) follows from an easy calculation using Lip schitz- continuity of L. 
The e-version L £ is again a convex P-square-integrable Nemitski loss func- 
tion of order p G [l,oo). For every (x,y,t) G X x y x R ; the function 
t I—?- L £ (x,y,t) is infinitely differentiable and the derivatives are given by 



gm 



L £ (x,y,t) 



d m y £ 
d m s 



{s)L(x, y,t — s) X(ds) 



Furthermore, for every (x,y,t) G X x y x R, 



\L'Jx,y,t) 



\L"{x,y,t)\ 



dH 



L £ (x,y,t) 



< b' ■ 



b' 



d_ 

dt 

dtp, 



L £ (x,y,t) 
(s)X(ds 



ds 



< b>, 

= : b" . 



(9) 

(10) 
(11) 
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Inequality hlO\) follows from the definition of derivatives by means of differ- 
ence quotients, and Lip schitz- continuity of L. Inequality \11\) follows 
from the definition of derivatives by means of difference quotients, (OJ) for 
771=1, and Lip schitz- continuity of L. 

In particular, the construction of such an e-version of L works for the hinge 
loss (classification) and, if J y 2 P(d(x,y)) < oo, for the epsilon-insensitive 
loss (regression). Another approach in order to obtain smooth approxima- 
tions of loss functions is proposed in Wekel et al. If 200 A) . 

The following Remark 13.61 shows that the limit distribution in Theorem 13.11 
is only degenerated in trivial cases. 

Remark 3.6 (Degenerated limit distribution) As shown in Proposi- 
tion \5.11\ in the appendix, the Gaussian process H in 

Vn(fL,r> n ,\ Dn ~ fL,p,\ ) ~» H 

(Theorem \3.1\) is degenerated to if and only if, for every h G H , there is a 
constant G R such that 

L'(x,y,f LtP>Xo (x))h(x) = c h for P - a.e. (x, y) G X x y . (12) 

This only happens in trivial cases in which statistical evaluations are super- 
fluous. Typically, h!2\) means that 



L'(x,y,f LiP ,x (x)) = for P — a.e. (x,y) £ X x y (13) 

and, therefore, the representer theorem (Steinwart and Christmanrl . 200& , 
Theorem 5.9) implies fL.P,\ ( x ) = almost surely so that |73|) implies 

L'(x,y,0) = for P-a.e. (x,y) £X xy (14) 

For example, \12\) implies hi Sty and |i^| ) if H is an RKHS which contains 
constants and at least one function which is not almost surely constant, or 
if H is a universal kernel (as in case of the Gaussian Kernel) and Xi is not 
almost surely a constant. 

Finally, let us summarize the implications of hi Sty and |j^| ) in case of differ- 
ent loss functions. Classification with Y{ £ {—1, 1}.' In case of the logistic 
loss, the squared loss and a slightly smoothed hinge loss, |7^]) is impossible. 
Regression: In case of the Ruber loss and the squared loss, |j^[ ) implies that 
Yi = almost surely. In case of a slightly smoothed e-insensitve loss, fi^| ) 
implies Yi G [— e, e] almost surely. 
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3.2 Supplements and Sketch of the Proof 

The proof of Theorems l3.ll and l3.2l is an involved application of the functional 
delta-method. In oder to describe this in some more detail, let us first fix a 
constant sequence of regularization parameters. That is, Ad„ = Ao £ (0, oo) 
for every n £ IN . Then, support vector machines may be represented by a 
functional S on a set of probability measures on (X x y, ^(X x 3^)) • This 
functional 

S : P M- f L , PM 

is called SVM- functional in the following. It represents the SVM-estimator 
because the empirical support vector machine is equal to fL,D n ,x = S(Pz>„) 
for every data set D n 6 {X x y) n where Pd„ denotes the empirical measure 
corresponding to D n . In order to use the functional delta-method, it is 
crucial that this is true for every sample size n and that S does not depend 
on n . (In Remark 13.71 it will be explained how it is nevertheless possible 
to deal with random sequences Ad„-) Theorem 13.11 can be shown in the 
following way: 

1. Show that ^Jn(J?■E> n — P) converges weakly to a Gaussian process. 

2. Show that S is Hadamard-differentiable: 

(a) Show that S is Gateaux-differentiable. 

(b) Show that the Gateaux-derivative fulfills a continuity property. 

(c) Show that (a) and (b) imply Hadamard-differentiability. 

3. Then, it follows from the functional delta-method that 

<MfLPn*> ~ fL,PM) = VHS( P DJ - S{Pj) 

converges weakly to a Gaussian process. Theorem 13.21 follows from Theorem 
13.11 by another application of the functional delta-method. 
Step 1 involves the study of Donsker classes. Among other things, this 
is based on a bound (I62D on the uniform entropy number of balls in the 
reproducing kernel Hilbert space H. A proof of this bound is given in 
the proof of Lemm a 15.91 In s imilar set tings, such bounds have a l ready 



been proven, e.g. in (|Zhoul . l2003l . § V) and (jSteinwart and Christmannl . 12008. 



§ 6.4). In general, y/n(Pr> n — P) is not a measurable random variable so that 
the proof invo lves the theory of weak convergenc e of unmeasurable random 
variables; see van der Vaart and Wellner ( 19961 ). However, this does not 



affect the statements of Theorems 13.11 and 13.21 because w-^/rn r^) i_ , 
is a measurable random variable as shown in the beginning of the proof of 
Theorem 13.11 in Subsection 15.41 

Essen tially, it has already been known that S is Gateaux-differentiable be- 



cause 



Christmann and Steinwart (2004, 20071 ) derive the influence function 
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of S which is a (special) Gateaux-derivati ve. Therefore, essential steps of the 



proof of S tep 2(a) can be adopted from IChristmann and Steinwartl (|2004l . 



20071 ) and (jSteinwart and Christmannl . 120081 . § 10.4) but some care is needed 



as we also have to deal with signed measures here. In addition, we also have 
to deal with a sequence of random regularization parameters Ad„ instead 
of a fixed Ao; see Remark 13.71 In Step 2(c) it will be shown that S is 
even Hadamard-differentiable (in a specific sense described in Subsection 
15. 3p . This is done because the application of the delta- method requires 
Hadamard-differentiability. However, this might also be useful for other 
purposes since, e.g., the chain rule is valid for Hadamard-different i abilit; 



but not for Gateaux-differentiability. IChristmann and Van Messeml (120081 ) 



show Bouligand-differentiability of the SVM- functional which also allows the 
chain rule. 

Remark 3.7 (Sequences of random regularization parameters Ad„) 

For a fixed regularization parameter Ao , support vector machines can be rep- 
resented by a functional S : P i— > /l,p,a and the delta-method can be applied 
for S. However, if we have a sequence of (random) regularization parameters 
Ad„; we get a (random) sequence of functionals 

for which the delta-method cannot be applied offhand. This problem can be 
solved in the following way: As described in Subsection \5.1[ 

s» n {P) = h,p,x Un = 4,^ PiAo = s(^p) vp. 



so that everything can be traced back to S . In this way, the explicit use 
of <Sx>„ can be avoided and the delta-method turns out to be applicable also 
in this case. The price we have to pay is that we have to deal with general 
finite measures in the proofs because, in general, x A ° ; P is not a probability 



measure any more. 



4 Conclusions 

In the article, asymptotic properties of support vector machines are investi- 
gated. For sequences of random regularization parameters Ad„, n 6 IN, such 
that -v/n(AD„ — Ao) — > in probability, it is shown that the difference be- 
tween the empirical and the theoretical SVM is asymptotically normal with 
rate yjn\ that is, \/n(fL,D n ,x Dn — /l.p,a ) converges to a Gaussian process 
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in the function space H. The value Ao > corresponds to a bound on the 
complexity of the estimate for the regression function; a smaller Ao allows for 
more complex functions. Therefore, the theoretical SVM /l p\ serves as 
a "smoother" approximation of more complex regression functions. The re- 
sults of this article show that, in nonpar ametric classification and regression 
problems, the estimation of this smoother approximation by use of empiri- 
cal SVMs in an infinite dimensional function space is asymptotically normal 
with rate y/n - just as if it was a parametric problem. The proof is done by 
showing that the map P h-> /l pa is suitably Hadamard-differentiable and 
by an application of a functional delta-method. 

Estimating a smoother approximation of the regression function is a com- 
prise between a parametric model and a fully non-parametric model without 
any assumptions on the regression function or the distribution. Without any 
of such assumptions, similar results are not possible as follows from the no- 
free-lunch theorem. 
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5 Appendix: Proof of the Main Results 

The assumptions of Theorem 13.11 are valid in the whole appendix. 



5.1 Preparations 

The map <& : X — > H always denotes the canonical feature map correspond- 
ing to the kernel k and the RKHS H. It will frequently be used in the proofs 
that the reproducing property implies 

(Hx)J) H = /(*) VxeX, V/€ff (15) 

or, in shorter notation, 

(*,/>h = / v/e#. (16) 

In particular, we have 

E M (d>, f) H = [ ($, f) H df, = f (Hx),f) H M (cfc) = / f(x) fi(dx). (17) 
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According to ( Steinwart and Christmann . 20081 . p. 124), boundedness of k 
implies: 



sup y/k(x, x) 



sup $(x) „ < oo 



(18) 



< 



H 



Vf€H. 



(19) 



In order to shorten notation, define 

L f : Xx y -> E, (x,y) i-> L f {x,y) = L(x,y,f(x)) 

for every function / : X — > R . Accordingly, define 

L 'f(.x,y) = L'(x,y,f(x)) and L" f (x,y) = L"(x,y,f(x)) 

for every (x, y) G x3^- As L is a P-square-integrable Nemitski loss function 
of order p 6 [1, oo) , there is a 6 E ^(P) such that 



|L(x,y,t)| < 6(x,y) + |t| p V (x, y, t) G AT x 3^ x R . 



(20) 



Let 



0! : = {g: Xxy^~R\3ze R d+1 such that g = I(_oo,*]} 

be the set of all indicator functions i(_oo,z]- Then, it is well-known that 

i/n(W n -F) ~> Gi in 4o(£i) 

where F n denotes the empirical process, F denotes the distribution function 
of P, G\ is a Gaussian process, and £oo(Gi) denotes the set of all bounded 
functions G : Q\ — > R. Provided that the SVM- functional S is Hadamard- 
differentiable in £oo(Gi), an application of the functional delta- method would 
yield asymptotic normality of v / n(S'(F n ) — S(F)^ . Unfortunately, the norm- 
topology of £oo(Gi) is too weak in order to ensure Hadamard-differentiability. 
Therefore, the set of indicator functions Q\ has to be enlarged to a set Q D Gi 
which leads to the following somewhat technical definition of the domain B$ 
of the SVM-functional S. Define 



Go 



CO 



g-.Xxy 




R 



bdP + 1 , 



]/o£ff, such that 

\\h\\n <c , II/Hh < 1 and 
5 = L' f J 



(21) 
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and 

g ■= g x u g 2 u {b} . 

Let £oo{G) be the set of all bounded functions 

F : g -+ R 
with norm H^Hoo = su P 9 g5 |-^G/)| ■ Define 



B s := < F : Q -»• R 



3^/0a finite measure on <Y x 3^ such that 
F(g) = Jgdfi Vgeg, 
beL 2 (fi), b' a eL 2 (fi) VaG(0,oo) 



and Bo := cl(lin(.Bs)) the closed linear span of Bs in ^oo(^) • That is, Bs is a 
subset of £oo{g) whose elements correspond to finite measures. The elements 
of Bs can be seen as some kind of generalized distribution functions. Note 
that the assumptions on L and P imply that g — > R, g i— >■ J* gdP is a 
well-defined element of -£>£■ . 

For every F <E Bs , let t(F) denote the corresponding finite measure \i on 
[X x y, *B(X x 3;)) such that 



F( 9 ) = / 



5^ yg&g 



Note that, by definition of Bs , i(^) uniquely exists for every F £ Bs so 
that 

1 : B s ^ c& + {X x xy)), F i-> fc (F) . 

is well-defined where ca + (Af x 3 ; ,2$(<^ x 30) denotes the set of all finite 
measures on {X x 3^, x y)). The set of all finite signed measures on 
(X x y,<B(X x y)) is denoted by c&(X x y,<B{X x y)). The set of all 
continuous functions / : X — > R is denoted by C(X). Since X is compact by 
assumption, the elements of C(X) are bounded and C{X) is endowed with 
the sup-norm ||/||oo = sup x6A . \f{x)\. 

By now, support vector machines are only defined for probability measures 
P. However, in order to deal with sequences of random regularization pa- 
rameters Ad„ , we will also have to deal with "support vector machines" for 
general finite measures /x. For every F 6 Bs, define 

hMF),x ■= arg inf / L(x,y, f(x)) i(F)(d(x,y)) + X\\ff H . 

Though fj, := i(F) £ ca, + (X x 3^, *&(X x 30) is not necessarily a probability 
measure, we have, in effect, not defined any new object. In order to see this, 
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note that dividing the objective function by M := \x(X x y) does not change 
the minimizer so that we get 



/l A 
L(x,y,f(x)) —fi(d(x,y)) + — 



is an "ordinary" support vector machine as -tt/j is a probabil 



and / £ 

ity measure. This also shows that fia \ uniquely exists because / 



uniqu ely exists for the probability measure -tj/x according to ( Steinwart and Christmann 
20081 . Lemma 5.1 and Theorem 5.2). 

The idea is that considering support vector machines for general finite mea- 
sures makes it possible to take Ao as a "standard regularization parame- 
ter". Define 

S : B s -> H, F m> 5(F) = 

where 

:= fh,i{F)M = arg inf / L(x,yJ(x))i(F)(d(x,y)) + A ||/|||- . 
Then, we can deal with other regularization parameters A > by use of 

h A F),x = S{^F) VFGB S . (22) 

This is important in order to apply the functional delta-method in case of 
a sequence of random regularization parameters Ad„ ; see also Remark 13.71 



It follows from ([Steinwart and Christmannl . |2008| . Eqn. (5.4) and Lemma 
4.23) that 



\Uf)\\ h < J^F(b) 



\fc(F)\ 



< 



VF£B S , 
VFe Bs . 



(23) 
(24) 



Since X is separabl e and k is a continuous kernel, the RKHS H is a separable 
Hilbert space; see (jSteinwart and Christmannl . l2008l . Lemma 4.33). Separa- 
bility of H is used several times in the proofs; this is important particularly 
with regard to the Bochner-integral of H- valued functions \P : Z — > H . The 
Bochner-integral f^fdfi = J^f dfi + — J\P dfj,~~ of such a //-valued function \& 
with respect to a finite signed measure /i = is again an element of H . 

If ^ is suitably measurable, then existence of the Bochner-integral follows 
from J \\^\\h d\fj,\ < oo where \fi\ = [i + + pT denotes the total variation of \i. 
We will also frequently use the fact that, for every Banach space E and every 
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continuous linear operator A: H — > E, the existence of the Bochner-integral 
j ^ dfi implies the existence of the Bochner-integral J d\x and 

J A(^)dn = A^J^df?j ; (25) 



see, e.g. (jDenkowski et al.l . 120031 . Theorem 3.10.16 and Remark 3.10.17). 
This subsection closes with three lemmas which are used several times. 
Thereafter, Gateaux-differentiability of the SVM-functional S : B$ — > H 
will be shown in Subsection 15.21 This is strengthened to Hadamard-differ- 
entiability in Subsection 15.31 Finally, it will be shown in Subsection 15.41 that 
V^ffi'Dn — P) converges weakly to a Gaussian process in £oo(G) and that 
this implies asymptotic normality of 

V / n(/L,Dn,A Dn _ fL,P,\ ) and Vn(^i,p(/i,D„,A D „) - T^L,pUl,Pm)) 
by applying a functional delta-method. 

Lemma 5.1 Let (F n ) n ^ C B$ be a sequence which converges to some 
Fq G Bs ■ Then, lin^^oo i{F n ){X x y) = l(Fq)(X x y) and the sequence of 
finite measures t(F n ), n G IN, converges weakly to l(Fq) . 

Proof: Define M n := i{F n )(X x y) and a n = (n, . . . ,n) G ~R d+1 for every 
n£lU {0} . Then, 

< \M n - Mb | = Hm |F„ (!(_«,,«,]) - F (/(_«,,«,]) | < ||F n - Fo^ — ► 0. 

Therefore, the normalized sequence F n = M~ x F n , n G WU{0}, corresponds 
to a sequence of probability measures t(F n ) such that 

lim i(F n )((-oo,a] n # x y) = lim — F n (l ( a] ) = — F (F ^ a] ) 

= t(F )((-oo,a]n^xy) 

for every a G K, d+1 . Hence, it follows from the Portmanteau theorem that 
the sequ ence of probabi l ity m easures (t(F n )) ng fj converges weakly to t(Fo) ; 
see e.g. ( van der Vaart . 19981 . Lemma 2.2). Finally, this implies that the 



sequence of finite measures (i(F n ))ne]N converges weakly to /-(F)) . □ 

Lemma 5.2 For every G G lin(-Bg) , there is a unique finite signed measure 
L (G) = fi on(X x y,*8(X x y)) such that 

J gdti = G{g) Vg£G. (26) 
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The map 

i : lm{B s ) -> ca(^ x x y) , G \-t l{G) . 

defined by A26\) is linear. Let G G lin^s) and fj, = t{G) . Then, 

b G L 2 (H) , b' a G Lzdul) VaG (0,oo) 

and L'f§ and L"h^ are Bochner-integrable with respect to \x for every f,h£ 
H. Furthermore, 

A f : C{X) -»• H, h ^ J L" f h$diJL, 

A f : H -> H , h H- J L" f h<S> dfj, . 
are continuous linear operators for every f G H . 

Proof: For every G G lin(-Bs) , there are F\,F 2 G B$ such that G = F1—F2 . 
Define fi := l(Fi) - i{F 2 ). Then, fj, fulfills flM}. From the definition of B s 
and 

H(C) < l{F x ){C) + l{F 2 ){C) VCe<B(Xxy) 

it follows that b, b' a G -^2 f° r every a G (0, 00) . Next, fix any / € H 
and define a = ||/||oo < 00; see (fT9|) . Then, 

/ ||L^|| H d|/i| < HAjHoo- / IL'/MH < H^Hoo - / < 00 



and, therefore, -^j^ is Bochner-integrable; see e.g. (jDenkowski et all 120031 . 



Theorem 3.10.3 and Theorem 3.10.9). A similar calculation shows that L'j-h^ 
is Bochner-integrable, too. 

In order to prove uniqueness of fx, let /zi and \i 2 be finite signed measures such 
that j gd[i\ = j gd[i 2 for every g G Q. From this equation it follows that 
J g d(ftf + ^2 ) = 1 9 d(l^2 +^r) f° r ever y 9 £ $■ Since \i\ + pi^ and \i\ + \x\ 
are finite (positive) measures and Q contains all indic ator functions /(_ 00, zl ; 
z G R rf+1 , it follows from the uniqueness theorem (e.g. ( Hoffmann- Jorgensenl . 



19941 . § 1.7)) that fj^ + ^ = IH + M 1 ■ Hence, \i\ = \i 2 . 
Uniqueness and (j26|) imply linearity of the map 1. 

Now let us turn over to Af for any fixed / G H . Obviously, Af is linear. In 
order to prove that Af is a continuous linear operator, define a := 
which is a finite number due to (I19p . Then, 



IT8B 

I A f (h) \\ H < \\L'}h$\\ H d\fj.\ < ||/»||ooWoo/|6o| \»\{d(x,y)) < cx 
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According to (jSteinwart and Christmannl . 120081 . Lemma 4.23), the canonical 
embedding H — > C(X) is a continuous linear operator. Hence, it also follows 
that Af is a continuous linear operator. □ 

Lemma 5.3 Let (^ n )neiN be a tight sequence of finite signed measures on 
(X x y,*8(X x y)) such that sup n6]N \fi n \(X x y) < oo . Let (/ n ) ne]N C H 
be a sequence converging to some /o G H . Then, 



lim sup 

\\H H <1 



J L'} n h$dfi n - J L" fQ h&dHr 



. 



H 



Proof: For every e > there is a compact subset Z e C X x y such that 
\Hn\(X x y\Z £ ) < e VnGlN. (27) 

Define a := sup ngWo \\f n \\oo < \\k\\oo sup ne]No ||/ n ||iJ < oo . For every n G 



sup 

h£H 



L" fn h§ dfi n - J L" fo h§ dfi r 



= sup 

IIMIh<i 



H 



< sup J \L'}Jx,y)-L%{x,y)\-\\h\\ 00 -\\<S>(x)\\ H \v n \(d(x,y)) < 



< 

< 
< 



I OO 



L f n ( x ,y) - L 'f ( x >y)\ \Vn\{d(x,y)) < 
1 1 \L'} n (x,y)-L%(x,y)\\vn\{d(x,y)) + 2\\k\\lb'ie < 



I oo 



fi n \(Xxy) sup \L'} n (x,y)-L%(x,y)\ + 2\\k\\Me 
(as,#)€.2s 



Since sup nG]N |// n |(,Y X 3^) < oo and e > can be chosen arbitrarily small, 
it only remains to prove that 



lim sup | L" fn (x,y)- L" h (x, y) 







(28) 



Continuity of L" and compactness of Z £ x [—a, a] imply that L" is uniformly 
continuous on Z e x [—a, a] . Assertion (I28p is an easy consequence of uniform 
continuity of L" on Z £ x [—a, a] , inequality —a < f n < a for every n G INo, 
and the fact that lim n ||/ n - / ||# = implies lim n - /o||oo = . □ 
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5.2 Gateaux-Differentiability of the SVM-Functional 

In this subsection, it will be shown that the SVM-functional 



S : B s -> H, F m. / t(F) 



is Gateaux-differentiable. Essentially, th is has already been known because 
Christmann and Steinwartl (|2004l . 120071 ) derive the influence function of S 
which is a (special) Gateaux-derivative. Therefore, the proofs in th i s sub- 
sectio n ca n essentially be adopted from IChristmann and Steinwartl (|2004l . 
20071 ) and (jSteinwart and Christmannl 120081 . § 10.4). However, some care is 
needed as we also have to deal with signed measures and with a (random) 
sequence of regularization parameters Ad 71 instead of a fixed Ao; see also 
Remark 13.71 

At first, we have to show Frechet-differentiability of the "generalized risk" 
T^L,n '■ f ^ / Lfd/j, (and of its derivative) for finite signed measures \i . 
If fj, is a probability measure, then Lemma 15.4( a) is just the well-known 
Frechet-differentiability of the ordinary risk IZl p . 



Lemma 5.4 For every finite signed measure jj, on [X x y, 5S(Af x y)) such 
that 



S hM < 00 and l bM < 00 vaG (°<°°)< 



(29) 



the following statements are true: 
(a) The map 

H E, 



J Lfdfj, 



is Frechet-differentiable and its Frechet- derivative in f G H is given 
by H —tR, /iH (/ L' f <S>dfi, h) H . 



(b) The map 



H -> H , / ^ J L' f $dfJ, 



is Frechet-differentiable and its Frechet- derivative in f G H is given 
by H -> H, h^jL'}h^dfi. 

Proof: Both statements can be p roven essentially by following the lines of 
(jSteinwart and Christmannl . 120081 . Lemma 2.21). Since the proofs of (a) and 
(b) nearly coincide, only the proof of (b) is given in detail. 
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Define 



T(f) = J L' f O>dn and T' f (h) = J L" f h<$> dfi 



for every /, h £ H . Lemma 15.21 guarantees that these Bochner-integrals 
exist and that T* : H — >• H, h i— > T'j{h) is a continuous linear operator. Now, 
fix any / 6 H and let (/i n )neiN C -ff \ {0} be a sequence which converges to 
in £f . Define 



7n(x,y) := 



[j^Qc, y,f{x)+h n (x)) - L'(x,y,f(x)) - h n (x)L"(x,y, f(x))\ 

\h n (x)\ 



for every (x,y) £ X x y such that h n (x) ^ and j n (x,y) = for every 
(x, y) £ X x y such that /i n (x) = . The maps 7„ : X x y — > R, (x, y) h-> 
7n(x,y) , n € IN, are measurable. Since is a RKHS, lim n _ s . 0O /i n (x) = 
for every x £ X . Therefore, the definition of L' as a partial derivative of L 
implies 

lim 7n (x, y) = y (x,y) £ X x y . (30) 
n— >-oo 

09} 

Define a := H/lloo + sup ngW ||/i n ||oo < HfcH^, (||/||h + sup n6]N ||/i n ||ff) < oo . 
Then, by use of the elementary mean value theorem, 

, \L'(x,yJ(x) + h n (x)) - L'(x,y,f(x))\ . ... , „ 
\ln(x,y)\< ] V ,y,JK ' g-^rj ^ ^ l +|L // (x,y,/(x))|< 26£ 

for every (x,y) such that h n (x) / and e very n £ IN . H ence, we can use 
the dominated convergence theorem (e.g. ( Dudley . 20021 . Theorem 4.3.5)) 
in order to finish the proof: 

\\T(f + h n )-T(f)-T' f (h n )\\ H 
hm — — < 

n->oo 1 1 "nil if 

< Hm f l ^^--\'y n (x,y)\-\\9(x)\\ H \^\(d(x,y)) < 

n— >oo J \\il n \\H 

ifTHTTQt r lorn 

< lim \\k\H / \ln(x,y)\ \fi\(d(x,y)) ^ 

n— >oo J ' ' ' 

□ 

Lemma 5.5 For every F £ B$ , 

K F : H -> H, f m- 2A /+ / L^/$d[t(F)] 
is a continuous linear operator which is invertible. 
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Proof: It follows from Lemma 15.21 that Kp is a continuous linear operator 
and it only remains to prove th at Kp is inver tible. This is done by use of the 



Fredholm alternative (see e.g. ( Griffel . 20021 . Th eorem 9.29)). The following 



proof is essentially a variant of the proof of (jSteinwart and Christmann 
20081 . Theorem 10.18). We have to show: 



(i) Kp is injective. 

(ii) A := Af as defined in Lemma 15.21 is a compact operator. 

Define p = t(F) . In order to prove (i), fix any / G H \ {0} and note that 
convexity of L implies L"f > . Therefore, 

\\KfU)\\ 2 h = (2X f + A(f),2X f + A(f)) H = 
= ±\ 2 \\f\\ H + i\ (f,A(f)) H + \\A(f)\\ H > ^o(f,A(f)) H = 

= 4A </ ,jL" f J$dp) H ^ A\ jL" f j{f^) H dv = ±\ofL" u -f 2 dv > 0. 
In the following, (ii) will be shown. To this end, let M C H be a (norm- 



)bou nded subset of H . Since X is compact, it follows from (jSteinwart and Christmann , 



20081 . Corollary 4.31) that M is a relatively compact subset of C(X) (with 
respect to the norm-topology of C(X)). In order to prove compactness of A , 
we have to show that every sequence (A(fj))j^ C / G M} contains 

a convergent subsequence. Relative compactness of M (in C{X)) implies 
that there is a subsequence (fj e )e^ C (fj)jeN which is a Cauchy-sequence 
in C(X) . Since Af, F . is a continuous linear operator on C(X) (Lemma 15. 2p . 
this implies that the sequence 

Mfit) = A MF) (f n ), £GM, 

is a Cauchy-sequence in H . Hence, (A(fj e ))i^ converges in H since H is 
complete. □ 

By use of these preliminary lemmas, Gateaux-differentiability of the SVM- 
functional can be shown now: 

Proposition 5.6 Let F G B s , G G Zoo{G) and p > such that F+sG G B s 
for every s G (—p,p). Then, there is a unique finite signed measure p such 
that 



i 



gdp = G(g) VgtG. (31) 
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Furthermore, 



lim 

s->0 



S(F + sG) - S(F) 



S' F (G) 



H 



where 

S' F (G) ~- 

In particular, S is Gateaux- differentiable 



(32) 



Proo f : The following proof is similar to the proof of (jSteinwart and Christmann , 



20081 . Theorem 10.18) but some care is needed because we also have to deal 



with signed measures here. 

Part 1: Define v := t(F) . Since G = s" 1 ((F + sG) - F) G lm(B s ) for any 
s G (— p,p) \ {0} , it follows from Lemma 15.21 that there is a unique finite 
signed measure p such that 



J gdp = G(g) VgeG 



(33) 



Define 



r: Ixff^ff, (*,/)■-> 2A / + J L' f $ dv + s J L' f § dp . 

Lemma 15.41 (b) implies that the maps H — > H , f f L'&dv and H — > 
H , f i— >• J -^j^ cfyi are continuous. Hence, an easy calculation shows that 
r is continuous. 

Part 2: In this part, it will be shown that V is continuously Frechet-differ- 
entiable. First, it follows from Lemma 15.41 (b) that the map 

dY f 

RxH -> H, (sj) h+ ^(«./) = J L' f $dn 

is continuous. Secondly, Lemma 15.41 (b) yields that the partial derivative 

Is* s '/) is § iven b y 



aff 



iT. 



h i-)- 2A /i + 



yiy/i^ di/ + s jL" f h$> dp 



for every (s, /) G II x ii . Let Z3(-ff, iJ) be the set of all continuous linear 
operators T : H —> H ; this is a Banach space with the operator norm. It 
follows from Lemma 15.31 that 



R x if -> B(H,H) 
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is continuous. Since T is continuous (as stated ab ove), this implies that T is 
continuously Frechet-differentiable according to (jDenkowski et al.1 . 120031 . p. 
635). 

Part 3: Now, we can prove the statement of the lemma by use of an implicit 
function theorem. It follows from Lemma 15.41 (a) that 



dlZ L 



dH 



(/) V/ei? Vse(- P ,p). 



(34) 



Since H — > R, / h- > TZl,u+sii,x is strictly convex and continuously Frechet- 
differentiable, the following assertion is valid for every s G (—p,p): 



o 



/ = /, 



U+SfJ, 



(35) 



(Direction "< =" follows from (Luenberger . 19691 . Theorem 7.4.1) and 
follows from ( Luenberger . 19691 . Lemma 8.7.1) and uniqueness of the min- 
imizer.) As shown in Part 2, T is continuously Frechet-differentiable. Ac- 
cording to Lemma 15.51 

§(»./„) = K F 

is an invertible oper ator. Therefore , it follows from a classical implicit func- 
tion theorem (e.g. ( Akerkar . 19991 . §4)) that there is a 5 G (0,p) and a 
Frechet-differentiable map ip : (—5,5)^H such that 



r((*, ¥>(*)) 

and the derivative is equal to 



Vse{-6,5) 



(36) 



v/(0) 



-i 



-i 



According to ([35]) and (f36j) , ^(s) = / y + SA1 
Define 5^(G) = <p'(0). Hence, 



lim 

s->0 



S(F + sG) - S(F) 



S' F (G) 



lim 

s->-0 



S(F + sG) for every s G (-5,5). 



v/(o) 



0. 

□ 



5.3 Hadamard-Differentiability of the SVM-Functional 

In this subsection, the result of the previous Subsection 15.21 is strengthened. 
In statistics, three different types of differentiability in Banach spaces are 



25 



particularly important: Gateaux-differentiability, Hadamard-differentiability 
and Frechet-differentiability. Among these, Gateaux is the weakest and 
Frechet is the strongest notion of differentiability. In order to apply the func- 
tional delta-method, we need the intermediate Hadamard-differentiability. 
It is well-known that a Gateaux-differentiable function is even Frechet-differ- 
entiable (and, therefore, Hadamard-differentiable) if the (Gateaux-)derivative 
is continuous. In the following Lemma 15.71 it will be shown that the 
Gateaux-derivative of S fulfills a certain continuity property ([38]) . This 
property is not strong enough in order to guarantee Frechet-differentiability. 
However, it will be shown in the proof of Theorem 15.81 that it is just strong 
enough in order to guarantee Hadamard-differentiability of S tangentially 
to the closed linear span of B$- In order to do this, we only have to slightly 
change the proof of the well-known interrelati onship between Gateau x- and 



Frechet-differentiability (as provided, e.g., by (IDenkowski et all 120031 . Prop. 
5.1.8)). 

Lemma 5.7 Let Bq = cl(lin(i3,g)) be the closed linear span of B$ in £oo{Q)- 
Let (G n ) nG iN C lin^g) be a sequence such that lim n _ i . 00 \\G n — Go||oo = 
for some Go G £oo{Q) and let (F n ) n£ ^ C B$ be a sequence such that 
linin^oo \\F n — .Folloo = for some Fq G B$ which fulfills 

F (b) < JbdP + A . (37) 

Then, there is a uq £ IN such that, for every F G {F n \n G IN> rao } U {Fq} , 
the map S' F : G h-> S' f (G) defined in Proposition \ 5.6\ can be extended to a 
continuous linear operator S' F : Bq — > H. In addition, 

lim ||^ n (G n )-^ (G )|| H = 0. (38) 
Proof: The proof consists of four parts: 

Part 1: Fix any F G B$ such that < c o where cq is defined as in 

(j2T|) . That is, 

L' L[F) f eg V/eff with ||/||h<1- (39) 

According to Lemma 15.21 the map S' F : G \— > S' F (G) defined in Proposition 
T6] can be extended to the map 

S' F : ]m(B s ) -> H, G -4 -K F l (e l{g) (L'^ 
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Since i is linear according to Lemma 15.21 this map is linear. In order to 
prove that S' F is a continuous linear operator on lin(.Bs') , it is enough to 
show that 

W F : lm(B s ) -> H, G -> E t(G) *) 

is a continuous linear operator because Ifp is a continuous linear operator 
according to Lemma 15.51 To this end, note that for every G G lin(i?s) and 
every f E H such that < 1 , 

• /}« ^ WW) ^ m. m f) ■ 

That is, for every f £ H such that < 1 , 

(W F (G),f) H = G{L' Mp) f) VG G lm(S s ) • (40) 

Hence, 



||W>(G0|| H = sup (W F (G),f) H ™ sup G(L' f f) < WG^ 

ll/llj/<i ll/ll H <i 
and, therefore, Wp is a continuous linear operator with operator norm 

< 1 . 

Since lin(Bs) is dense in Bq , Wp can be extended to a continuous linear 
operator Wp : Bq — >■ with operator norm 

||W>|| < 1 , (41) 

see e.g. ( Megginson . 19981 . Theorem 1.9.1). Hence, S' F can be extended to 
the continuous linear map 

S' F : B H , G ^ -K~ l (W F {G)) 

on Bq = cl(lin(i?5)) . In particular, the latter is eventually true for F = F n 
because it follows from lim n _ s . 0O — -Folloo = 0, b G G , ([2"Tj) . ([2"5j) and ([57]) 
that there is some no G IN such that 

||/ t (F„)L ^ c ° VnGM> no U{0} . 
and, therefore, F = F n fulfills ([MI) for every n G JN>„ U {0} . 
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In addition, note that, for every G G Bq, there is a sequence G n G tin (.65), 
n£l, which converges to G and, therefore, 

HI 



;^ (G),/)h = lhnJW Fo (G n ),f) H ^ Mm G n (L^/) = G(L^/) 



for every f £ H such that < 1. As ET^ is invertable, 5^ (G) = if 

and only if W Fo (G)) = 0. Summing up, we may record for later purposes 
(Proposition 15.11]) that, for every G G .Bo, 

S' Fo (G) = & G ( L> L m f) = V / e # such that WfWn ^ L ( 42 ) 
Part In this part of the proof, it will be shown that 

Kp n > Kp^ in the operator norm . (43) 

To this end, it suffices to show that 

K Fn > Kp in the operator norm 



according to ( Dunford and Schwartz . 19581 . Lemma VII.6.1). Because of 



\K Fn (f)-K Fo (f)\\ H 



< 



+ 



+ 



H 



this can be done by showing 



lim sup 

ll/llff<l 



(44) 



H 



and 



lim sup 

n->oo /€H 

II/IIh<i 



In order to prove (I44p . define 
1 



: {Fo) /d> d[L{F n )\-jL" hiF j$ dUF )} 



. (45) 



i(F n )(*x3>) 



F n and A„ := 



i(F n )(Xxy) 



VnGlNU{0}, 
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and 



c(F n )(xxy) 
i(F )(xxy) 



F VnGlNU{0} 



Then, i{F n ) is a probability measure and, according to Lemma [5. 1| it follows 
that Imxn^oo i{F n )\ 
Fq\\oo = . Hence, 



that linij^oo t{F n ) [X x = l(F )[X x 3^) and, therefore, lum^oo ||F n 



L> $ d[ t (F n ] - / L' f $ d[t(F ] 

•'i,t(Fg),A n / JL,t(F ),A n 



(*) 1 

< lim — 

n-»oo \ 

An 



1 



/i, i (F , n ),A 1 ^ 



lim 

n— ¥oo \ 

^M\ w h^)- w h,s^)\\H 



11 

^d[i(F ] 



11 

(46) 



where (*) follows from ( Steinwart and Christmann , 20081 . Theorem 5.9 and 
Corollary 5.19). 

Since lim^oo i{F n )[X x y) = l(F )(X x y), it follows from (JUJ), ([23]) and 
(|3Tjl that 

Ik«.(F )IIh — c ° for large enough n£l. 

Hence, 



l|/^n)-A(F0)L < l|WX n (^»)-W^ n (fb)|| ff < 



SD i , . 

< lim -=— _F n — F 

n— >oo \ 







(47) 



Therefore, (|44p follows from Lemma 15.31 

In order to prove (|4"5j) . define M := su p„ c1 Mi i{n} tfi^,)^ x y) < oo (see 



Lemma l5.ip and note that, according to dSteinwart and Christmannl . 12008 . 
Corollary 4.31), 

T\ = {f£H\ \\f\\ H <l} C C(X) 

can be identified with a relatively compact subset oiC(X) (with respect to 
the norm-topology of C(X)) . Hence, for every e > 0, there is an m e £ IN 
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and functions f\, . . . , f ms £ C(X) such that 



3 lloo 



< 



sup 



< 



mm 

je{l,...,m e } 



\\k\\oo Vje{l,...,m e }, (48) 
/-/i|L < e V/€ Ji . (49) 



Define a := ||A(f ) lloo • Fix any / £ Ji and take jo G {1, . . . ,m £ } such that 
11/ -/j lloo <£• Then, 



L U)/ $ 4^n)]-_/^ 0) /$4^0)] 

4(, 0) (/ - f*>)*d[L(F n )] - J Ll (p Jf - f h )*d[i(F )] 
~ J L 'kiF o) fio*d[L(F )} + J L" UFQ) f^d[c{F n )] 

^ J \\ L fw(f-f*)Hs d tt F »)] + / \\ L 'k iF Jf-fio)H H dHFo)) 

L l {F J^M^)] - [ L'L m fjo$d[L(F n )] < 

J H 

J L 'L m f^d[i(F )]-J L'} 4Fo) f jo ^d[,(F n )] 



(f5TT8l 

< 26 , a '||£;|| 00 Me + 



Hence, 



sup 



J L l^d[c(F n )] -jLl (Fo) f$d[L(F )] 



< 



(50) 



H 



< 2i£||fc|| 00 Me + max 

j£{l,...,m £ } 



H 



Convergence of (F n ) n ^ in £oo(G) implies weak convergence (Lemma 15. lj) 
and, therefore, ti ghtness of the sequence of finite measures (i(-^n))neiN; see 
e.g. (jBauerl . l200ll . Theorem 30.8). Hence, there is a compact set Z e C X x y 
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such that, for its complement Ci? e , we have sup ngffJo t(F n )(CZ £ ) < e. Then, 



max 
je{l,...,m e } 



< max 

j£{l,...,rn e } 



H 



L "u F J^MFn)] 



< max 

j£{l,...,rn e } 



H 



According to ( Bourbaki . 20041 . p. III. 40), weak convergence of the sequence 
of finite (positive) measures (i(F n )) n ^ implies 



lim 

n— >oo 



L i iF j^ d vm - j L 'k {F j^ d W] 







for every j G {1, . . . ,m e } . (Since if is a separa ble Banach sp ace, Pettis 
integrals and Bochner-integrals coincide; see e.g. (jDudleyj, [2002J, p. 194f).) 
As e > can be arbitrarily small, f|45|) follows from (|50p and the above 
calculation. 



Part 3: In this part of the proof, it will be shown that 



lim \\W Fn (G )-W Fo (G 



\H 



. 



(51) 



For every m G IN , we have G m G lin(i?s) and, therefore, 

W Fn {G m ) = j L' f ^J>d[i{G m )] 

for every n G INo- Hence, it follows from (|47p and Lemma l5.4l b) that 

lim \\W Fn {G m )-W Fo (G m )\\ =0 Vm G IN . (52) 

Furthermore, we have 

SB 

lim sup W Fn (G m ) — W Fn (Gq) „ < lim \\G m - G \\= (53) 



According to (jDunford and Schwarta . Il958l . 1.7.6), ([52]) and (f53|) imply 
lim \\W Fn (G )-W Fo {G )\\ H = lim lim II W Fn (G m ) - W Fo (G m )\\ 



lim lim \\W Fn (G m )-W Fo (G r , 



m—>oo n—>oo 



\H 



. 
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Part 4-' By use of the previous parts, we complete the proof by proving (|38p: 

J™ ll-siw(G») -<sib(Gb)IU = j3fell^("W^w(Gf w )) -^(Wi^CGo)) ||^ 
- ^lo W K F l A w F n (G n )) - K^(W Fn (G n ))\\ H + 

+ \\K^(W Fn (G n )) -Kp o 1 (W Fn (G ))\\ H + 
+ \\K^(W Fn (G ))-K^(W Fo (G ))\\ H = 

ED , . 1M „ „ , 1M M 

< Km {{K^-K^H •||^(G n )|| ff +11^11 -\\W Fn (G n )-W Fn (G )\\ 

™ }Too ll^ 1 -^o 1 ll-II^L + Klhll^-GoL = 



Theorem 5.8 For every Fq G B$ which fulfills (37\ ), the map 

S : B s -> H, F ^ f i{F) 

is Hadamard-differentiable in Fq tangentially to the closed linear span Bq = 
cl(lin(i?5)) . The derivative in Fq is a continuous linear operator S' F(j : 
Bq — > H such that 



S' Fn (G) 



VGe lin(5c 



(54) 



Proof: Let (G n )neiN C ioo{Q) and (t n )neiN C R\{0} be sequences such that 
lim n ^ 00 ||G n — Golloo = for some Go G ^oo(^), such that t n \ 0, and such 
that F n := F +t n G n G -Bs for every n£l. Then, lim^oo ||.F n -.Fo||oo = 
and G n G lin(B s ) for every n G M. According to Lemma 15. 7( there is a 
no G IN such that, for every F G {-FnK G M^gluj-Fb}, there is a continuous 
linear operator S' F : Bq — > H which fulfills (|54p . We have to show 



lim 

n— Yoo 



S(F + t n G n ) - S(F 



S' Fo {Gq) 



. 



H 



Note that the assumptions imply Go £ Bq . Define 

h n := S(F + t n G n ) - S(Fq) - t n S' Fo (G ) Vn G M . 
That is, for every f £ H , 

(f,K) H = (f,S(F + t n G n ) - S(Fq)) h - {f,t n S' Fo (G )) H . 



(55) 



(56) 



(57) 
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In order to prove for every n G IN" that the function 

[0,1] -> H, s ^ S(F + st n G n ) 

is well-defined, we have to show that Fq + st n G n G -Bg for every s £ [0, 1] . 
It follows from F n 6 i?g that G n G lin(Bs) ■ Therefore, there is a finite 
signed measure [x n ,s such that fi U:S = l(Fq + st n G n ) and Fo + st n G n G 
lin^s) . Take any A G <B(Af x y) . Then, it follows from l(F )(A) > 0, 
t(f n )(A) > and s G [0,1] that ^ S {A) = l(F + st n G n ){A) > 0. That 
is, /i ns = t(-Fo + st n G n ) is a finite measure. Furthermore, it follows from 
Fo 7^ 0, F n 7^ and s G [0, 1] that /U„ jS 7^ . According to the definitions, 
this shows that F + st n G n G Bg . 

Fix any n G IN. The function s 1— > S(Fq + st n G n ) is continuous on [0, 1] ac- 
cording to (I47p and Frechet-differentiable on (0, 1) according to Proposition 
15.61 the derivative in s G (0, 1) is given by S' Fo+stnGn {t n G n ) . Since the map 
h i—T- (/, K)h is Frechet-differentiable for every f G H, this implies that 

(0,1) ^ E, S ^ (f,S(F + St n G n )) H 

is differentiable for every / G H; the derivative in s G (0, 1) is given by 
{f,S' Fo+stnGn (t n G n )) H ■ Define h n = h n /\\h n \\H • According to the elemen- 
tary mean value theorem, there is an s n G (0, 1) such that 

(h n ,S' Fo+SntnGn (t n G n ))H = (h n ,S(F + t n G n ))n - (h n ,S(F )) H = 

= (h n ,S(F + t n G n )-S(F )) H 

By use of the definition of h n , this implies 

(h n ,h n )H = (h n , S' Fo+SntnGn (t n Gn) - t n S' Fo (G ))H 

and, by use of the definition of h n , the latter equality and the Cauchy- 
Schwarz inequality imply 

- W^Fo+SntnGn^nGn) - t n S' Fo (Go)\\ H ■ (58) 
Then, (155p follows from 
S(F + t n G n ) - S{F ) 



Sf (Go) 



\S(F + t n G n )-S(F )-t n S' Fo (G )\\ ip 1 

j_ 1 1 1 1 — 



tn tr 



1 



- '7~\\ S F +Snt n G n i. t nG n ) ~ t n S Fo (G )\\ H — \\S' Fo+SntnGn (G n ) - S Fo (G )\\ R 

because the last expression converges to according to Lemma 15.71 □ 
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5.4 Donsker- Classes and Application of the Delta-Method 

It is well-known that 

yfii(W n -F) ~> Gi in l^OSi) 

where F n denotes the empirical process, F denotes the distribution function 
of P, G% is a Gaussian process, and Gi is the set of all indicator func- 
tions. However, as already noted in Subsection 15.14 the set of indicator 
functions had to be enlarged to a set Q D Gi in order to ensure Hadamard- 
differentiability of the SVM-functional 

S : Bs — > H 

in a neighborhood of F E B$ C £<x>(G)- Therefore, it still has to be proven 
that weak convergence not only holds in £oo(Gi) but also in £oo(G)- This is 
done in the following Lemma [5 .91 After that, the main results can be proven 
by applications of a functional delta-method. 

Lemma 5.9 For every D n = ((z l5 y x ), . . . , (x n , y n )) E (X x y) n , let F Dn 
denote the element of £oo{G) which corresponds to the empirical measure 
T Dn . That is, F Dn (g) = JgdP Dn = ± X)£=i0Oc*>2/i) for every g eG . 
Then, 

V^iw^-r^P)) ^ G in 4c(£) 

where G : ft — > £oo(G) is a tight Borel- measurable Gaussian process such 
that G(u}) £ Bo for every u E £1- 

Proof: In other words, we have to show that G is a P- Donsker class. 
Part 1: Fix any c E (0, oo) . In Part 1 of the proof, it will be shown that 

T c := {f€H \ \\f\\ H <c} 

has a finite uniform entropy integral. Since X C R rf is bounded, there is an 
r > such that X C {x E R rf | ||x|| R d < r} =: X . Then, X is a convex, 
bounded subset of R rf with non-empty interior. Let H be the RKHS of the 
restriction of the kernel k on X x X and define 

f c := {feH \ \\f\\ 6 <c} . 

It follows from (|Berlinet and Thomas-Agnan . 120041 . Theorem 4.2.6) that 



J- c := {/ E H J / is the restriction of some / E H} . (59) 
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According to ( van der Vaart and Wellner . 19961 . p. 154), let C™(X) denote 
the set of all functions / : X — >■ R which have uniformly bounded partial 
derivatives up to order m — 1 and whose partial derivatives of order m — 1 
are Lipschitz-continuous such that 

llfll \atff \\ a. \d a f(x)-d a f(x>)\ 
f , := max sup o fur + max sup 1 - < 1 . 



a|<m-l *te<* 



\a =m — l 



x,x'£X 
x^x' 



x — X' 



\~R. d 



It follows from convexity of X and the mean value theorem that 
\d a f(x) - d a f(x') 



max sup 



\a. | =m — 1 



F ~~ 33 llR d 



< max sup 9 Q f(x) . 



X^X 



I a | — m 



Hence, it follows from ( Steinwart and Christmann , 20081 . Corollary 4.36) 
that, for every / € T c , 

ll/IL < max sup \d a f(x)\ < \\f\\ & max sup (d a ' a k(x, x)) 



a | <m 





| a | <m 



< c- max sup (d a ' a k(x, x)) =: a c 6 (0,oo) 



That is, — T c C C^iX) and, therefore, it follows from ( van der Vaart and Wellnerl . 
19961 . Theorem 2.7.1) that there is a constant r 6 (0, oo) such that, for every 
e > 0, 



lniV(a c £,J- c ,||-||oo) = hiN(e,±f c ,\\ • |U) < 



Here and in the following, iV(-, •, •) denotes the covering number and N\ 



(60) 



denotes the bracketing number; see e.g. (jvan der Vaart and Wellnerl . Il996l . 
§ 2.1.1). According to ([59]) . T c is the set of restrictions of the elements of T c 
on X . By use of this fact, it is easy to see that 



lniV(£,7- c ,|| • IU) < lnN(e,f ( 
for every e > . Therefore, it follows from (I60p that 



CI II lloo 



) 



< r • al 



Ve > . 



(61) 
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Now, choose the constant f c = ||&||ooC+ 1 as an envelope of J- c . Every 
element / G T c can be identified with a function X x y — > R via f(x, y) = 
f(x) . For every probability measure P on {X x y, *B(X x y)) , we obtain 

II/IIwp) < sup \f(x,y)\ = sup|/(a;)| = 

(x,y)£Xxy x&X 

Therefore, it follows from (1611) that 



A A 

suplniV( £ ||/ c || L2(P) ,J- e ,||-|| L2(P) ) < r^ ||fc|| J +1 j (62) 

where the supremum is taken over all probability measures P on (X x 
y, < 3(X x y)) • Since m > | by assumption, the function class T c has a 
finite uniform entropy integral. That is, 

/ /suplniV(e||/ c || L ( p } ,J" c , || • \\ L {P) ) A(cte) < oo . 
J (0,1) VP 



1(0,1) V P 
Part 2: Now, it will be shown that 

Q' := {L' f : (x,y) L'(x,y,f(x)) | / G -F C() } 
also has a finite uniform entropy integral. Since 

CEHJ 

sup |./0)| < II^HooCo =: a V / G J" co , 

the assumptions imply that g' := b" + b' a is an envelope function of Q' such 
that < b" a < g' and, for every (x,y) £ X x y and every /i, /2 G , 

I^Cx.y) -L' f2 (x,y)\ < ^|/x(x) - / 2 (x)| < 5 '0, y)||/i-/ 2 |L (63) 

where (*) follows from the assumptions on L" and the elementary mean value 
theorem. For every probability measure P on ( X x y, %{X x y)) such that 



< f (</) 2 ciP < oo , it follows from (j63|) and (jvan der Vaart and Wellnerl . 



19961 . p. 84 and Theorem 2.7.11) that, for every e > 



^N[s\\g'\\ L2iP) ,g',\\-\\ L2{P) ) < lniV [] (2 £ || 5 '|| i2{ p ) ,g',||-|| i2( p ) ) < 

ED a /r " 



16JJ A /i\ 

< lniV(e,J- C0 ,|| • ||oo) < r-a^ - 



36 



Hence, the assumption m > i implies that Q' has a finite uniform entropy 
integral. 

Part 3: Now, it will be shown that Q is a P - Donsker class. Trivially, jo) is a 

P - D onsker class because b G L^iP) by assumption. From (Ivan der Vaart and Wellnerl . 
19961 . Example 2.5.4) it follows that Qi is P- Donsker. Note that Q 2 = G'-T c 
for c = 1 . According to Part 1, the class T c has a finite uniform entropy 
integral relative to the (constant) envelope f c and, according to Part 2, 
the class Q' has a finite unifor m entropy integral re lative to the envelope 



g' . Therefore, it follows from (jvan der Vaartl . 1 19981 . Example 19.19) that 



Q2 = Q' ■ T c has a finite uniform entropy integral relative to the envelope 
f c g' . The definitions an d assumptions i mply J (f c g') 2 dP < 00 . 
Hence, it follows from ( van der Vaart . 19981 . Theorem 19.4) that Q2 is a 
P- Donsker cla s s pro vided that Q2 is "suitably measurable". According to 
van der Vaart . 19981 . p. 274), it suffices to show that there is a countable 



subset Q2 C Q2 such that, for every g G Q2 , ther e is a sequence (g n )n.pw C G 



which converges pointwise to g . According to (jSteinwart and Christmann 



20081 . Lemma 4.33), H is a separable Hilbert space and, therefore, the sub- 
sets T c C H are also separable for c = 1 and c = cq . That is, there are 
countable subsets T\ C T\ and T CQ C T co which are dense in T\ and T co 
respectively (with respect to the norm topology). Then, 

<?2 := {L)J\ I f eT Co , fiETx} 

is again countable. Fix any g £ G2 ■ That is, there are /o G T co and f± G T\ 
such that g = L'^fi . Furthermore, there are sequences (/d n ^) ng]N £ Ft 
and (/i (n) ) ng]N G T\ such that 



lim ||/ {n) -/o||if = and lim \\f™ - h\\ H 



f(n) 



Next, define g n := L ■ -r n )fi ^ ^2 for every n G M. Since H is a reproducing 

So 

kernel Hilbert space, norm convergence implies pointwise convergence so 
that, for every (x, y) G X x y, 

lim g n {x,y) = lim L'(x, y, f^ n \x))f[ n \x) = L'(x,y,f (x))fi(x) = g(x,y) 
due to continuity of L' . 

Pari ^: As ^ is assured to be a P- Donsker class, we have 
V^(F D „ -i' X {P)) ~> <G in ^(g) 



. 
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where G : O, — > £oo(G) is a tight Borel-measurable Gaussian process. Since 
V^(^D„(w) ~~ L ~ 1 {P)) ^ A) for every ui £ Q and every n £ IN", it follows from 
close dness of Bq and the Portmanteau theorem (jvan der Vaart and Wellnerl . 
19961 . Theorem 1.3.4(iii)) that G(oj) S Bq almost surely. Hence, we may as- 
sume without loss of generality that G(u) £ Bq for every wed, (Otherwise, 
replace G by G • (I Bo ° G) .) □ 

For ease of reference, the following lemma summarizes some facts about 
Bochner- integrals of tight Gaussian processes in a space £qo(T). Later on, 
these facts are needed in order to prove that the Gaussian process H : $7 — > H 
is zero-mean. 

Lemma 5.10 Let T be any set, £oo(T) the set of all bounded functions 
h : T — > R (endowed with the supremum-norm) and G : $7 — > £oo(T) a, tight 
Borel-measurable Gaussian process such that 



G(w)(t)Q(dw) = 



Vt €T . 



(64) 



Then, the Bochner-integral of G : Q — > £oo(T) exists and J G(lu) Q{dw) = 0. 
Furthermore, J A(G) dQ = for every Banach space E and every continuous 
linear operator A : (T) — > E . 

Proof: Since G is tight, it is also separable s o that there is a separable 
subse t T C £oo{T) such that Q(G G T) = 1 ; see (jvan der Vaart and Wellnerl . 
1996, 16f). As the close d linear span of a separable subset of a Banach 



sp, 

space is again separable (jSchechterl . I2004I . Lemma A. 48), we may assume 
without loss of generality that T is a separable Banach space. Define G = 
G • (Ir ° G) . Then, G : — > T is a Borel-measurable map. Let h : V — )■ R 
be a continuous linear functio nal. According to the Hahn-Banach-Theorem 
( Dunford and Schwartz . 19581 . Theorem II. 3. 11), h* can be extended to a 

t^(T) R. Since h*( G) is normally 



continuous linear functional h* 



19961 . Lemma 3.9i 



distributed according to (Ivan der Vaart and Wellnerl . 
and h*(G) = h*(G) Q — a.s. , the real random variable h*(G) is normally 
distributed. This proves that the Borel-measurable map G : Q — > V is a 
Gaussian p rocess in the separable Banach space T . Hence, it follows from 



Satd (Il97ll ) that J \\G\\ dQ < oo and, therefore, 



G|| dQ < oo . 



(65) 



(jFernique ( Il970h ; troves a related statement for centered Gaussian processes 
but we still have to prove that G is centered and this will be done by 
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use of (1651) so that we ca nnot use Fernique's theorem here.) According 
to (jPenkowski et all . l200l Theorem 3.10.3 and Theorem 3.10.9), ([65]) is 
equivalent to the existence of the Bochner-integral J G dQ . 

Note that, for every t £ T, the map n : £oo(T) — > JR., h H > h(t) is a continu- 
ous linear operator. Then, by use of the fact that t he Bochner-integr a l may 



be interchanged with continuous linear operators (jPenkowski et al 
Theorem 3.10.16 and Remark 3.10.17), we get 



2003, 



G(u)Q(dw) (t) = n / G(u) Q(du) 



rt(G(w)) Q(dw) 



G(w)(t)Q(dw) 



for every t £ T. That is, J GdQ = 0. Using again the fact that the 
Bochner-integral may be interchanged with continuous linear operators, we 
finally get JA(G) dQ = A(jG dQ) = A(0) = 0. □ 

Proof of Theorem I3.lt First, it will be shown that 



H 



U ^ /z,,D„(w),A 



D n (w) 



isBorel-measurable. According to the assumptions, it follows from (jSteinwart and Christmann 

20081 . Lemma 5.13 and Corollary 5.19) that [X x y) n ->■ H, D n i-> 

/l,d„,a is continuous for every constant A G (0, oo) and that (0, oo) — >• 

H, A t-t /l,d„,a is continuous for every D n £ (A!x 3^) n - Hence, (D n ,X) \— > 

fr,,n„,\ is a Cara t heodo ry function and, therefore, measurable; see, e.g., 

(jPenkowski et all 120031 . Theorem 2.5.22). Since uj i— > D n (cj) and w i— > 

An are assumed to be measurable, the compound function oj h >■ / r D \ , , 

is again measurable. 



In or der to apply the functional delta-method (jvan der Vaart and Wellnerl . 
19961 . Theorem 3.9.4), note that £qo(Q) and H are Banach spaces. Recall 
from Lemma 15.91 that Fq„ : & — > B$, oj h-> Fn„( w ) is the random map 
where Frj n ( w ) is that element of B$ which corresponds to the empirical 
distribution of D n (w) = \iX\(ui), Y\(oS)) : . . . , (X n (uj) ,Y n (u))) . That is, 

1 n 



i=l 



Pefine 



Fo^i-^P) and 



F 



D„ 
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Then, Lemma 15.91 yields 



G 



in ioo(Q) 



where G : f2 — ?■ £oo(G) is a tight Borel- measurable Gaussian process which 
takes it values in Bq . Furthermore, 



/ 



G(uj)(g)Q(doj) = 



(66) 



see (van der Vaart and Wellner . 19961 . p. 81f). According to ( van der Vaart and Wellner . 
19961 . p. 16f), G is also separable (which is important in order to apply 
Slutsky's lemma for Banach space valued random maps below). Note that 
\/"(^Dn ~~ ^o) — > in probability implies An /An — > 1 and -^(Ad 



Aq) / Ad„ — > in probability; see e.g. (Ivan der Vaart 



1998, Theorems 2.3 and 



2.7vi ). Hence, it follows from Slutsky's lemma (jvan der Vaart and Wellnerl . 
Il996l . p. 32) that 



V^{U - Fq) = V^(F Dn - F Q ) ■ + 



\M a d„ - A ) 



G 



in £cp (G)- Then, applying the delta-method (jvan der Vaart and Wellnerl . 



122jT Theorem 3.9.4) yields 



vM/L,D n ,A D „ - fL,p,x ) = V^{S(Cn) ~ S(F )) ^ S' Fo (G). 

Since S' Fq is a continuous linear operator and G is a ti ght Borel-measurable 
Gaus sian process, 5^ Q (G) is Gaussian as well; see, e.g., (jvan der Vaart and Wellnerl . 
19961 . §3.9.2). Since H is a c omplete and separable metric space, 5^ (G) is 
tight; see e.g. (IDudleyl . l2002l . Theorem 11.5.4). 

It follows from (|66p and Lemma 15.101 that <S^ (<G) has mean zero. 



□ 



Proof of Theorem 13. 21 It follows from Lemma [5.4l that the risk functional 
Hl,p is Hadamard-differentiable in H tangentially to H; the derivative of 
1Zl,p in f £ H is the continuous linear operator 



K', 



L,P;f 



: H -»■ E, h ^ (J L 'f® dP > h )n ■ 



According to Theorem 13. 11 v / ^(/i,D„ ) A D ~~ /l,p,a ) ~* ^ where H : Q, — > H 
is a tight Borel-measurable Gaussian process which has zero-mean and does 
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not depend on An„ but only on A n- Then, it follows from the delta-method 
(jvan der Vaart and Wellnerl . Il996l . Theorem 3.9.4) that 



V^{K L ,p(fL,v n ,x Dn )-n L ,p(fL,P,x )) - ^ jP;/LPAo (H). 

Since 1Z' L p ,i px is a continuous linear operator, and H is Gaussian, the 
(real valued) random variable 1 Z' L p . f i x (H) is normally distributed; see 
e.g. ( van der Vaart and Wellner . 1990 §3.9.2). Therefore, it only remains 



to prove that the mean of 1Z' L p x (H) is equal to 0. This follows from 

E ^,P;/,,p,, (E) = E(| L' fLpx $dP,u) H = 
as H : f2 — > H has zero-mean. □ 

Proposition 5.11 Under the assumptions of Theorem \ 3.1l the Gaussian 
process 

H : U -> H, uj i — y H(w) 

in ||J) is degenerated to if and only if for every h G H , there is a constant 
Ch € R such that 

L'{x,y,f L> p,\ {x))h(x) = c h for P - a.e. (x, y) £ X x y . (67) 

Proof: According to the proof of Theorem 13.11 the Gaussian process H is 
equal to 5^ (G) and, according to (jHZj) . 5^ (G) is equal to if and only 
if G(Lj p h) is equal to for every h G H such that \\h\\n < 1. As 
shown in Lemma 15. 9} the class of functions Q is a P-Donsker class and, 
accordingly, the distribution of the marginals G (L j ^ /i) of the limit of 

Vn(F Dn - i -1 (P)) ~c G in £oo(Q) is equal to J\f(0,a%) where 
-1= f( L ^ h - f L 'f^ hdP ^ dP 



see e.g. ( van der Vaart and Wellner . 19961 . §2.1). That is, H = almost 



surely if and only if = for every h £ H. □ 
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